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Pipelined Accumulators 
BACKGROUND OF THE INVENTION 

5 Field of the Invention 

The present invention pertains to the art of digital accumulators, particularly 
for hardware architectures for high-speed digital accumulators used in digital signal 
processing. 

10 

Art Background 

A key element in many signal processing systems is the digital accumulator 
(digital integrator). Examples include phase accumulators for Numerically Controlled 

15 Oscillators (NCOs), and integrators such as those used in Cascaded Integrator-Comb 
filters. The device is composed of a register and an adder in a feedback configuration. 
As the number of symbols (one or more bits or the equivalent of one or more bits in 
non-binary systems) increases, the maximum clock rate that may be realized generally 
decreases with the primary limitation being the carry propagation requirements of the 

20 adder. This can be a serious constraint on the achievable speed for very wide 

accumulators. The feedback nature of the structure suggests that a common technique 
to increase the speed of feed-forward structures, pipelining, may not be employed. 
Designers in the past would generally employ various carry speedup methods, but the 
amount of speedup achievable is limited without large increases in gate count. 

25 

What is needed is a method for speeding up digital accumulators. 

SUMMARY OF THE INVENTION 

30 The performance of parallel digital accumulators for use in digital signal 

processing is improved through pipelining. An accumulator is partitioned into a 
plurality of pipelined stages, and the pipeline delay is used to reduce the effect of 
carry propagation through the accumulator While input and output delay registers 
are used in the accumulator partitions, the output delay registers are not needed if the 
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results of those partitions are not needed in subsequent stages of computation. If 
phase coherence is not needed, input delay registers may not be needed on 
accumulator partitions. In the limiting case of one bit per partition, the effective speed 
of the pipelined accumulator is equivalent to the speed of a single bit accumulator 
5 stage. 

BRIEF DESCRIPTION OF THE DRAWINGS 

The present invention is described with respect to particular exemplary 
10 embodiments thereof and reference is made to the drawings in which: 

Fig. 1 shows a z-domain representation of a digital accumulator according to 
the prior art, 

15 Fig. 2 shows a digital accumulator with latency, 

Fig. 3 shows a decomposed digital accumulator, 
Fig. 4 shows a first pipelined digital accumulator, 

20 

Fig. 5 shows a second pipelined digital accumulator, 

Fig. 6 shows a third pipelined digital accumulator, and 

25 Fig. 7 shows a fourth pipelined digital accumulator. 

DETAILED DESCRIPTION 

As the number of bits in a digital accumulator increase, the maximum clock 
30 rate that may be realized decreases. The primary cause of this decrease is the carry 
propagation time for the adder. While various carry-lookahead techniques are known 
to the art, their application yields large logic structures and excessive gate counts, 
especially as the number of bits in the adder increase. The z-domain representation of 
a tjq)ical digital accumulator 100 is shown in Fig. 1. Register 110 provides parallel 
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Storage for the desired width of the accumulator. The output 140 of register 110 is 
one input to parallel adder 120. The other input to adder 120 is parallel input 130. 
The output 150 of adder 120 is gated into register 110 by a clock (not shown). 

5 According to the present invention, the pipelined accumulator is derived by 

first considering the accumulator of Fig. 2. The indicated width of 48 bits is arbitrary 
and used as an example only. Register 200 introduces a one clock period latency. 

The 48 bit accumulator of Fig. 2 is decomposed as shown in Fig. 3 to an upper 
10 accumulator 310 handling the 24 most significant bits (MSBs) input 312 and output 
314, and a lower accumulator 320 handling the 24 least significant bits (LSBs) input 
322 and output 324. The partitioning into 24-bit sections is arbitrary; any partiton is 
allowable. MSB inputs 312 are delayed 316, and LSB inputs 322 are delayed 326 
prior to being passed to their respective adders. 

15 

Note that in this decomposition, the only communication between upper 
accumulator 310 and lower accumulator 320 is the single-bit carry signal 330. 

According to the present invention, as shown in Fig. 4, input delay 326 of Fig. 
20 3 is moved through the block to its two outputs, delaying 430 carry output 432 which 
becomes carry input 434 to upper accumulator 410, and delaying 440 lower 
accumulator output 424 to provide delayed output 444. 



This restructuring according to the present invention has not changed the 
25 input-output relationship from that shown in Fig. 2. However, register 430 has broken 
the carry chain, and the overall system can now produce a full 48-bit parallel result 
operating at the rate of a 24-bit adder, where in Fig. 2 the overall system was limited 
by the speed of the single 48-bit adder. 

30 While Fig. 4 shows a decomposition into two partitions of 24 bits, this process 

of decomposition can continue, with additional latencies added. For example, 
decomposition may be done in multiple 4-bit or 8-bit partitions. At the limit of this 
decomposition, shown in Fig. 5, each partition contains only a one-bit full adder and a 
one-bit register. Ultimately, a pipelined accumulator according to the present 
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invention can operate at a speed limited only by this basic element. The number of 
partitions employed determines the amount of latency required. 

Fig. 5 shows an accumulator of N partitions in the range 0 to N-1 . The total 
5 latency through the accumulator, and through each partition is N periods. While Fig. 
5 shows each partition as a single bit, each partition may represent a plurality of bits. 
Parallel inputs are applied to input lines 110 (most significant), 210, 310, 410, and 
510 (least significant). Parallel outputs are presented at output lines 160 (most 
significant), 260, 360, 460, and 560 (least significant). 

10 

Examining the most significant partition 100 of Fig. 5, input 110 feeds input 
delay register 120, which delays input 110 for (N-1) periods and feeds adder 130. 
The output of adder 130 feeds output register 140 which provides a one period delay. 
Since input delay register 120 provides a delay of (N-1) periods, and output register 
15 140 provides a delay of 1 period, no additional output delay is needed to present an N 
period total delay to output 160. Carry input 170 to adder 130 is provided by register 
180. While most significant partition 100 is shown as a single bit, it equally well 
represent multiple bits, or a nonbinary representation. 

20 Next-most significant partition 200 has input 210 feeding input delay register 

220 which provides (N-2) periods of delay and feeds adder 230. The output of adder 
230 feeds output register 240, which provides a one period delay. Since a total of N 
periods of delay are required, output delay register 250 is needed to provide 1 unit of 
delay to output signal 260. Carry output 290 from adder 230 feeds register 180, 

25 providing the carry input signal to adder 130. Cany input 270 for adder 230 comes 
from register 280, which in turn is driven by the carry output signal of the next most 
significant partition. 

Turning now to the least significant partitions 300, 400, 500 of Fig. 5, the 
30 overall requirement is for a latency of N periods between input and output. Least 
significant partition 500 has zero input delay, and no input delay register. Output 
register 540 provides 1 period of delay, so output delay register 550 is needed to 
provide (N-1) periods of delay to output signal 560. Next partition 400 has input 
delay 420 of 1 period and output delay 450 of (N-2) periods. Partition 300 has input 
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delay 320 of 2 periods and output delay 350 of (N-3) periods. The carry chain begins 
with carry out signal 510 from adder 530 which is held and delayed by register 480, 
feeding carry in 470 of adder 430. 

5 The input delay registers 120, 220, 320, 420 of Fig. 5 are needed to maintain 

phase coherence in applications such as numerically controlled oscillators (NCOs). 
Where such phase coherence may be sacrificed, for example when an accumulator 
according to the present invention is used in a frequency-hopping system, these may 
be eliminated, as is shown in Fig. 6. 

10 

Fig. 7 shows an implementation of the present invention where the LSB data is 
not required. While 24 bits is shown in each of the upper 710 and lower 720 
partitions, this split is arbitrary and shown as an example only. In the case where the 
lower partition bits are not needed as part of the accumulator output, output latency 
15 compensation registers are not needed for lower partition 720. This implementation 
may be used when for example a value with a fixed number of fractional bits is being 
accumulated, but only the integer portion of the accumulated result is needed. 

The foregoing detailed description of the present invention is provided for the 
20 purpose of illustration and is not intended to be exhaustive or to limit the invention to 
the precise embodiments disclosed. Accordingly the scope of the present invention is 
defined by the appended claims. 
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